3.6 James-Stein Estimator

1 Gaussian Sequence Model

Recall Gaussian sequence model XNd(θ,Id). The goal is to estimate θRd via δ(X) with low MSE: MSE(θ;δ)=Eθ||θδ(X)||2.
The model is more general than it appears. Like if X1,,Xni.i.dNd(θ,σ2Id) for known σ2>0, we could make a sufficiency reduction to obtain Z=1σni=1nXiNd(θ,Id).

1.1 Bayes Estimators

If we introduce Bayesian prior θii.i.dN(0,τ2), then Bayes estimator is τ21+τ2X.
We can think of this as a tuning parameter for a generic linear shrinkage estimator δζ(X)=(1ζ)X, where ζ[0,1] is in effect a tuning parameter we will call shrinkage parameter. Taking ζ=(1+τ2)1 corresponds to the Bayes estimator.
If we aren't sure which ζ to use, (like have a priori uncertainty about τ2), we can use hierarchical Bayes: δ(X)=(1E[ζ|X])X=δζ^Bayes(X)(X), so we are in effect estimating ζ from the whole data set and the plugging it in as a data-adaptive tuning parameter.

If d3 then the UMVUE for ζ is ζ^UMVU(X)=d2||X||2, based on the fact that Yχd2=Gamma(d2,2),d>2E[1Y]=1d2.
Plugging in ζ^UMVU results in an estimator called James-Stein estimator: δJS(X)=(1d2||X||2)X=δζ^UMVU(X).

1.2 James-Stein Paradox

While James-Stein estimator can be motivated as an empirical Bayes estimator, it is good even without making any Bayesian assumptions at all.

For d3, the estimator X is actually inadmissible as an estimator of θ: MSE(θ,δJS)<MSE(θ,X),θRd.

This is surprising because it not only beats the UMVUE on average, but every fixed value of θ.

In fact, if we use an estimator shrinking towards θ0Rd: δ~(X)=θ0+(1d2||Xθ0||2)(Xθ0). This also dominates δ0 because it is just the James-Stein estimator we'd get if we substitute Y=Xθ0Nd(μ,Id),μ=θθ0.
By the translation invariance of Gaussian location model, James-Stein estimator for μ using Y also dominates μ^0(Y)=Y, i.e. δ0(X)=μ^0+θ0=X for θ.

1.3 Linear Shrinkage Estimators

Even without introducing a Bayesian prior for θ, we can motivate our linear shrinkage estimator purely from the perspective of trading bias for a reduction in variance. Calculate MSE: Eθ[(θδi(X))2]=(θiEθ(1ζ)Xi)2+Varθ(1ζ)Xi=(ζθi)2+(1ζ)2,MSE(θ;δ)=ζ2||θ||2+d(1ζ)2.
So let 0=ddζMSE(θ;δ)=2ζ||θ||22(1ζ)dζ(θ)=dd+||θ||2.
This looks similar to 11+τ2 (Bayes-optimal ζ under the Gaussian prior)

2 SURE

Theorem (Stein's Lemma)

Suppose xN(θ,σ2) and h:RR is differentiable, with E|h˙(X)|<. Then (2.1)Cov(X,h(X))=E[(Xθ)h(X)]=σ2E[h˙(X)].

Now consider the multivariate version. For a function h:RdRd, define Jacobian matrix DhRd×d: (Dh(x))ij=hixj(x) and the Frobenius norm ARd×d as ||A||F=i,jAij2.

Theorem (Stein's Lemma, Multivariate)

Assume XNd(θ,σ2Id), h:RdRd is differentiable with E||Dh(X)||F<. Then (2.2)E[(Xθ)Th(X)]=σ2Etr(Dh(X))=σ2i=1dEhixi(X).

Apply Stein's lemma to h(x)=xδ(x), then MSE(θ;δ)=Eθ||δ(X)θ||2=Eθ||Xh(X)θ||2=Eθ||Xθ||2+Eθ||h(X)||22Eθ[(Xθ)Th(X)]=σ2d+Eθ||h(X)||22Eθtr(Dh(X)).
(Note that we assume XNd(θ,σ2Id) now.)
So if σ2 is known, we obtain the unbiased estimator (2.3)MSE^(X)=σ2d+||h(X)||22σ2tr(Dh(X)).
We call it Stein Unbiased Risk Estimator (SURE).

3 Risk of the James-Stein Estimator

Now we calculate the risk of James-Stein estimator δJS(X)=(1d2||X||2)X.